Web-Based Bengali News Corpus for Lexicon Development and POS Tagging
نویسندگان
چکیده
Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The corpus contains approximately 34 million wordforms. This corpus is used for lexicon development without employing extensive knowledge of the language. We have developed the POS taggers using Hidden Markov Model (HMM) and Support Vector Machine (SVM). The lexicon contains around 128 thousand entries and a manual check yields the accuracy of 79.6%. Initially, the POS taggers have been developed for Bengali and shown the accuracies of 85.56%, and 91.23% for HMM, and SVM, respectively. Based on the Bengali news corpus, we identify various word-level orthographic features to use in the POS taggers. The lexicon and a Named Entity Recognition (NER) system, developed using this corpus, are also used in POS tagging. The POS taggers are then evaluated with Hindi and Telugu data. Evaluation results demonstrates the fact that SVM performs better than HMM for all the three Indian languages.
منابع مشابه
Lexicon Development and POS Tagging Using a Tagged Bengali News Corpus
Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing(NLP) application areas. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper...
متن کاملMaximum Entropy Based Bengali Part of Speech Tagging
Part of Speech (POS) tagging can be described as a task of doing automatic annotation of syntactic categories for each word in a text document. This paper presents a POS tagger for Bengali using the statistical Maximum Entropy (ME) model. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various POS cl...
متن کاملVoted Approach for Part of Speech Tagging in Bengali
Part of Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate syntactic category called part of speech. POS tagging is a very important preprocessing task for language processing activities. In this paper, we report about our work on POS tagging for Bengali by combining different POS tagging systems using three weighted voting techniques. The individual POS t...
متن کاملWeakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages
This paper examines unsupervised approaches to part-of-speech (POS) tagging for morphologically-rich, resource-scarce languages, with an emphasis on Goldwater and Griffiths’s (2007) fully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon, and consequently, we propose a weakly supe...
متن کاملNamed Entity Recognition from Bengali Newspaper Data
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in natural language processing (NLP) for information extraction that is used to locate and classify content in news data according to predefined categories such as person name, place name, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Polibits
دوره 37 شماره
صفحات -
تاریخ انتشار 2008